NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Be Sure to Use the Same Writing Style: Applying Authorship Verification on Large-Language-Model-Generated Texts

https://doi.org/10.3390/app15052467

Weerasinghe, Janith; Seepersaud, Ovendra; Smothers, Genesis; Jose, Julia; Greenstadt, Rachel (February 2025, Applied Sciences)

Recently, there have been significant advances and wide-scale use of generative AI in natural language generation. Models such as OpenAI’s GPT3 and Meta’s LLaMA are widely used in chatbots, to summarize documents, and to generate creative content. These advances raise concerns about abuses of these models, especially in social media settings, such as large-scale generation of disinformation, manipulation campaigns that use AI-generated content, and personalized scams. We used stylometry (the analysis of style in natural language text) to analyze the style of AI-generated text. Specifically, we applied an existing authorship verification (AV) model that can predict if two documents are written by the same author on texts generated by GPT2, GPT3, ChatGPT and LLaMA. Our AV model was trained only on human-written text and was effectively used in social media settings to analyze cases of abuse. We generated texts by providing the language models with fanfiction snippets and prompting them to complete the rest of it in the same writing style as the original snippet. We then applied the AV model across the texts generated by the language models and the human written texts to analyze the similarity of the writing styles between these texts. We found that texts generated with GPT2 had the highest similarity to the human texts. Texts generated by GPT3 and ChatGPT were very different from the human snippet, and were similar to each other. LLaMA-generated texts had some similarity to the original snippet but also has similarities with other LLaMA-generated texts and texts from other models. We then conducted a feature analysis to identify the features that drive these similarity scores. This analysis helped us answer questions like which features distinguish the language style of language models and humans, which features are different across different models, and how these linguistic features change over different language model versions. The dataset and the source code used in this analysis have been made public to allow for further analysis of new language models.
more » « less
Free, publicly-accessible full text available February 25, 2026
Feature Vector Difference based Neural Network and Logistic Regression Models for Authorship Verification

Weerasinghe, Janith; Greenstadt, Rachel (January 2020, CEUR workshop proceedings)
Cappellato, Linda; Eickhoff, Carsten; Ferro, Nicola; Névéol, Aurélie (Ed.)
This paper describes the approach we took to create a machine learning model for the PAN 2020 Authorship Verification Task. For each document pair, we extracted stylometric features from the documents and used the absolute difference between the feature vectors as input to our classifier. We created two models: a Logistic Regression Model trained on a small dataset, and a Neural Network based model trained on the large dataset. These models achieved AUCs of 0.939 and 0.953 on the small and large datasets, making them the second-best models on both datasets submitted to the shared task.
more » « less
Full Text Available
The Pod People: Understanding Manipulation of Social Media Popularity via Reciprocity Abuse

https://doi.org/10.1145/3366423.3380256

Weerasinghe, Janith; Flanigan, Bailey; Stein, Aviel; McCoy, Damon; Greenstadt, Rachel (April 2020, WWW '20: Proceedings of The Web Conference)

Full Text Available
Analyzing Machine Learning Models that Predict Mental Illnesses from Social Media Text

Weerasinghe, Janith; Morales, Kediel; Greenstadt, Rachel (July 2018, Privacy Enhancing Technologies Symposium)

Previous studies, both in psychology and linguistics, have shown that individuals with mental illnesses show deviations from normal language use, that these differences can be used to make predictions, and used as a diagnostic tool. Recent studies have shown that machine learning can be used to predict people with mental illnesses based on their writing. However, little attention is paid to the interpretability of the machine learning models. In this talk we will describe our analysis of the machine learning models, the different language patterns that distinguish individuals having mental illnesses from a control group, and the associated privacy concerns. We use a dataset of Tweets that are collected from users who reported a diagnosis of a mental illnesses on Twitter. Given the self-reported nature of the dataset, it is possible that some of these individuals are actively talking about their mental illness on social media. We investigated if the machine learning models are detecting the active mentions of the mental illness or if they are detecting more complex language patterns. We then conducted a feature analysis by creating feature vectors using word unigrams, part of speech tags and word clusters and used feature importance measures and statistical methods to identify important features. This analysis serves two purposes: to understand the machine learning model, and to discover language patterns that would help in identifying people with mental illnesses. Finally, we conducted a qualitative analysis of the misclassifications to understand the potential causes for the misclassifications.
more » « less
Full Text Available
Exploring How Social Media Content Feeds Are Manipulated

Weerasinghe, Janith; Gill, Cynthia; Richards, Jaime; McCoy, Damon; Greenstadt, Rachel (August 2018, USENIX Security '18)

Online Social Networks (OSNs) utilize curation algorithms to present relevant content to users. These algorithms can be manipulated by users with various intentions. We investigate common methods used by manipulators as part of a larger project looking to improve OSN defenses against manipulators.
more » « less
Full Text Available

Search for: All records